Final Project
Baseball Statistics
[Fall 2015] DataVisualizations, Cannata
12/7/15

Brian Bargas, German Martinez, Chris Pena

Data:

Our Data was obtained from the Sean Lahman Online Baseball Database. It contains complete batting and pitching statistics from 1871 to 2014, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. Specifically we took the master, hall of fame, and xxx cvs.

Once we had this csv chosen we reformatted them using the ETL.R scripts then imported them into our oracle database.

The csv we used were; Batting Hall of Fame Master Salary Team TeamFrancise

source("~/Documents/Academic/CompSci/Data/FinalProject/DV_FinalProject/00 Doc/filesummary.R", echo = TRUE)
## 
## > require(tidyr)
## 
## > require(dplyr)
## 
## > require(ggplot2)
## 
## > setwd("~/Documents/Academic/CompSci/Data/FinalProject/DV_FinalProject/01 Data")
## 
## > file_path <- "Batting.reformatted.csv"
## 
## > df <- read.csv(file_path, stringsAsFactors = FALSE)
## 
## > summary(df)
##    playerID             yearID         stint          teamID         
##  Length:99846       Min.   :1871   Min.   :1.000   Length:99846      
##  Class :character   1st Qu.:1932   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :1971   Median :1.000   Mode  :character  
##                     Mean   :1963   Mean   :1.077                     
##                     3rd Qu.:1996   3rd Qu.:1.000                     
##                     Max.   :2014   Max.   :5.000                     
##                                                                      
##    lgID               G                AB              R         
##  Mode:logical   Min.   :  0.00   Min.   :  0.0   Min.   :  0.00  
##  NA's:99846     1st Qu.: 13.00   1st Qu.:  7.0   1st Qu.:  0.00  
##                 Median : 35.00   Median : 58.0   Median :  5.00  
##                 Mean   : 51.46   Mean   :150.6   Mean   : 19.98  
##                 3rd Qu.: 81.00   3rd Qu.:252.0   3rd Qu.: 30.00  
##                 Max.   :165.00   Max.   :716.0   Max.   :192.00  
##                                  NA's   :5149    NA's   :5149    
##        H               X2B              X3B               HR        
##  Min.   :  0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:  1.00   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 11.00   Median : 2.000   Median : 0.000   Median : 0.000  
##  Mean   : 39.43   Mean   : 6.654   Mean   : 1.385   Mean   : 2.944  
##  3rd Qu.: 64.00   3rd Qu.:10.000   3rd Qu.: 2.000   3rd Qu.: 3.000  
##  Max.   :262.00   Max.   :67.000   Max.   :36.000   Max.   :73.000  
##  NA's   :5149     NA's   :5149     NA's   :5149     NA's   :5149    
##       RBI               SB                CS               BB        
##  Min.   :  0.00   Min.   :  0.000   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.: 0.000   1st Qu.:  0.00  
##  Median :  4.00   Median :  0.000   Median : 0.000   Median :  3.00  
##  Mean   : 18.04   Mean   :  3.182   Mean   : 1.337   Mean   : 13.88  
##  3rd Qu.: 27.00   3rd Qu.:  2.000   3rd Qu.: 1.000   3rd Qu.: 20.00  
##  Max.   :191.00   Max.   :138.000   Max.   :42.000   Max.   :232.00  
##  NA's   :5573     NA's   :6449      NA's   :28603    NA's   :5149    
##        SO              IBB              HBP               SH        
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:  2.00   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 10.00   Median :  0.00   Median : 0.000   Median : 1.000  
##  Mean   : 21.57   Mean   :  1.23   Mean   : 1.114   Mean   : 2.486  
##  3rd Qu.: 30.00   3rd Qu.:  1.00   3rd Qu.: 1.000   3rd Qu.: 3.000  
##  Max.   :223.00   Max.   :120.00   Max.   :51.000   Max.   :67.000  
##  NA's   :12987    NA's   :41712    NA's   :7959     NA's   :11487   
##        SF             GIDP       
##  Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 0.00   1st Qu.: 0.000  
##  Median : 0.00   Median : 1.000  
##  Mean   : 1.16   Mean   : 3.225  
##  3rd Qu.: 2.00   3rd Qu.: 5.000  
##  Max.   :19.00   Max.   :36.000  
##  NA's   :41181   NA's   :31257   
## 
## > file_path <- "HallOfFame.reformatted.csv"
## 
## > df2 <- read.csv(file_path, stringsAsFactors = FALSE)
## 
## > summary(df2)
##    playerID             yearid       votedBy             ballots     
##  Length:4088        Min.   :1936   Length:4088        Min.   : 78.0  
##  Class :character   1st Qu.:1950   Class :character   1st Qu.:226.0  
##  Mode  :character   Median :1964   Mode  :character   Median :274.0  
##                     Mean   :1969                      Mean   :319.7  
##                     3rd Qu.:1986                      3rd Qu.:425.0  
##                     Max.   :2015                      Max.   :581.0  
##                                                       NA's   :193    
##      needed          votes          inducted           category        
##  Min.   : 59.0   Min.   :  0.00   Length:4088        Length:4088       
##  1st Qu.:175.0   1st Qu.:  2.00   Class :character   Class :character  
##  Median :206.0   Median : 10.00   Mode  :character   Mode  :character  
##  Mean   :243.3   Mean   : 50.52                                        
##  3rd Qu.:321.0   3rd Qu.: 63.00                                        
##  Max.   :436.0   Max.   :555.00                                        
##  NA's   :350     NA's   :193                                           
##  needed_note       
##  Length:4088       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##                    
## 
## > file_path <- "Master.reformatted.csv"
## 
## > df3 <- read.csv(file_path, stringsAsFactors = FALSE)
## 
## > summary(df3)
##    playerID           birthYear      birthMonth        birthDay    
##  Length:18589       Min.   :1820   Min.   : 1.000   Min.   : 1.00  
##  Class :character   1st Qu.:1894   1st Qu.: 4.000   1st Qu.: 8.00  
##  Mode  :character   Median :1935   Median : 7.000   Median :16.00  
##                     Mean   :1930   Mean   : 6.626   Mean   :15.62  
##                     3rd Qu.:1967   3rd Qu.:10.000   3rd Qu.:23.00  
##                     Max.   :1994   Max.   :12.000   Max.   :31.00  
##                     NA's   :145    NA's   :315      NA's   :472    
##  birthCountry        birthState         birthCity           deathYear   
##  Length:18589       Length:18589       Length:18589       Min.   :1872  
##  Class :character   Class :character   Class :character   1st Qu.:1941  
##  Mode  :character   Mode  :character   Mode  :character   Median :1966  
##                                                           Mean   :1963  
##                                                           3rd Qu.:1988  
##                                                           Max.   :2014  
##                                                           NA's   :9364  
##    deathMonth        deathDay     deathCountry        deathState       
##  Min.   : 1.000   Min.   : 1.00   Length:18589       Length:18589      
##  1st Qu.: 3.000   1st Qu.: 8.00   Class :character   Class :character  
##  Median : 6.000   Median :15.00   Mode  :character   Mode  :character  
##  Mean   : 6.486   Mean   :15.55                                        
##  3rd Qu.:10.000   3rd Qu.:23.00                                        
##  Max.   :12.000   Max.   :31.00                                        
##  NA's   :9365     NA's   :9366                                         
##   deathCity          nameFirst           nameLast        
##  Length:18589       Length:18589       Length:18589      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   nameGiven             weight          height          bats          
##  Length:18589       Min.   : 65.0   Min.   :43.00   Length:18589      
##  Class :character   1st Qu.:170.0   1st Qu.:71.00   Class :character  
##  Mode  :character   Median :185.0   Median :72.00   Mode  :character  
##                     Mean   :185.6   Mean   :72.24                     
##                     3rd Qu.:197.0   3rd Qu.:74.00                     
##                     Max.   :320.0   Max.   :83.00                     
##                     NA's   :872     NA's   :809                       
##     throws             debut            finalGame        
##  Length:18589       Length:18589       Length:18589      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    retroID            bbrefID         
##  Length:18589       Length:18589      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##                                       
## 
## > file_path <- "Salaries.csv"
## 
## > df4 <- read.csv(file_path, stringsAsFactors = FALSE)
## 
## > summary(df4)
##      yearID        teamID              lgID             playerID        
##  Min.   :1985   Length:24758       Length:24758       Length:24758      
##  1st Qu.:1993   Class :character   Class :character   Class :character  
##  Median :2000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2000                                                           
##  3rd Qu.:2007                                                           
##  Max.   :2014                                                           
##      salary        
##  Min.   :       0  
##  1st Qu.:  260000  
##  Median :  525000  
##  Mean   : 1932905  
##  3rd Qu.: 2199643  
##  Max.   :33000000  
## 
## > file_path <- "Teams.reformatted.csv"
## 
## > df5 <- read.csv(file_path, stringsAsFactors = FALSE)
## 
## > summary(df5)
##      yearID         lgID              teamID            franchID        
##  Min.   :1871   Length:2775        Length:2775        Length:2775       
##  1st Qu.:1919   Class :character   Class :character   Class :character  
##  Median :1962   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1954                                                           
##  3rd Qu.:1991                                                           
##  Max.   :2014                                                           
##                                                                         
##     divID                Rank              G             Ghome      
##  Length:2775        Min.   : 1.000   Min.   :  6.0   Min.   :44.00  
##  Class :character   1st Qu.: 2.000   1st Qu.:153.0   1st Qu.:77.00  
##  Mode  :character   Median : 4.000   Median :157.0   Median :80.00  
##                     Mean   : 4.119   Mean   :150.2   Mean   :78.43  
##                     3rd Qu.: 6.000   3rd Qu.:162.0   3rd Qu.:81.00  
##                     Max.   :13.000   Max.   :165.0   Max.   :84.00  
##                                                      NA's   :399    
##        W                L             DivWin             WCWin          
##  Min.   :  0.00   Min.   :  4.00   Length:2775        Length:2775       
##  1st Qu.: 66.00   1st Qu.: 65.00   Class :character   Class :character  
##  Median : 77.00   Median : 76.00   Mode  :character   Mode  :character  
##  Mean   : 74.68   Mean   : 74.68                                        
##  3rd Qu.: 87.00   3rd Qu.: 86.50                                        
##  Max.   :116.00   Max.   :134.00                                        
##                                                                         
##     LgWin              WSWin                 R                AB      
##  Length:2775        Length:2775        Min.   :  24.0   Min.   : 211  
##  Class :character   Class :character   1st Qu.: 612.0   1st Qu.:5124  
##  Mode  :character   Mode  :character   Median : 690.0   Median :5384  
##                                        Mean   : 681.9   Mean   :5138  
##                                        3rd Qu.: 764.0   3rd Qu.:5516  
##                                        Max.   :1220.0   Max.   :5781  
##                                                                       
##        H             X2B             X3B               HR       
##  Min.   :  33   Min.   :  3.0   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:1298   1st Qu.:193.0   1st Qu.: 31.00   1st Qu.: 41.0  
##  Median :1393   Median :230.0   Median : 41.00   Median :106.0  
##  Mean   :1346   Mean   :227.1   Mean   : 47.27   Mean   :100.5  
##  3rd Qu.:1468   3rd Qu.:270.0   3rd Qu.: 60.00   3rd Qu.:148.0  
##  Max.   :1783   Max.   :376.0   Max.   :150.00   Max.   :264.0  
##                                                                 
##        BB              SO               SB              CS        
##  Min.   :  0.0   Min.   :   0.0   Min.   :  0.0   Min.   :  0.00  
##  1st Qu.:424.5   1st Qu.: 503.0   1st Qu.: 64.0   1st Qu.: 35.00  
##  Median :493.0   Median : 750.0   Median : 96.0   Median : 46.00  
##  Mean   :473.7   Mean   : 732.2   Mean   :112.6   Mean   : 48.97  
##  3rd Qu.:554.0   3rd Qu.: 962.0   3rd Qu.:142.0   3rd Qu.: 58.00  
##  Max.   :835.0   Max.   :1535.0   Max.   :581.0   Max.   :191.00  
##                  NA's   :120      NA's   :144     NA's   :859     
##       HBP               SF              RA               ER        
##  Min.   : 26.00   Min.   :24.00   Min.   :  34.0   Min.   :  25.0  
##  1st Qu.: 47.00   1st Qu.:39.00   1st Qu.: 608.5   1st Qu.: 499.0  
##  Median : 54.50   Median :44.00   Median : 688.0   Median : 590.0  
##  Mean   : 56.28   Mean   :44.92   Mean   : 681.9   Mean   : 570.2  
##  3rd Qu.: 64.00   3rd Qu.:50.00   3rd Qu.: 764.0   3rd Qu.: 666.0  
##  Max.   :103.00   Max.   :75.00   Max.   :1252.0   Max.   :1023.0  
##  NA's   :2325     NA's   :2325                                     
##       ERA              CG              SHO               SV       
##  Min.   :1.220   Min.   :  0.00   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.:3.335   1st Qu.: 15.00   1st Qu.: 6.000   1st Qu.: 9.00  
##  Median :3.820   Median : 46.00   Median : 9.000   Median :24.00  
##  Mean   :3.813   Mean   : 50.99   Mean   : 9.646   Mean   :23.46  
##  3rd Qu.:4.300   3rd Qu.: 78.00   3rd Qu.:13.000   3rd Qu.:38.00  
##  Max.   :8.000   Max.   :148.00   Max.   :32.000   Max.   :68.00  
##                                                                   
##      IPouts           HA            HRA             BBA       
##  Min.   : 162   Min.   :  49   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:4074   1st Qu.:1288   1st Qu.: 44.5   1st Qu.:427.0  
##  Median :4230   Median :1392   Median :108.0   Median :495.0  
##  Mean   :4019   Mean   :1345   Mean   :100.5   Mean   :474.1  
##  3rd Qu.:4341   3rd Qu.:1470   3rd Qu.:147.0   3rd Qu.:556.0  
##  Max.   :4518   Max.   :1993   Max.   :241.0   Max.   :827.0  
##                                                               
##       SOA               E               DP              FP        
##  Min.   :   0.0   Min.   : 47.0   Min.   : 18.0   Min.   :0.7600  
##  1st Qu.: 500.0   1st Qu.:117.0   1st Qu.:126.0   1st Qu.:0.9600  
##  Median : 729.0   Median :146.0   Median :145.0   Median :0.9700  
##  Mean   : 725.6   Mean   :187.3   Mean   :140.1   Mean   :0.9613  
##  3rd Qu.: 958.5   3rd Qu.:218.0   3rd Qu.:159.8   3rd Qu.:0.9800  
##  Max.   :1450.0   Max.   :639.0   Max.   :217.0   Max.   :0.9910  
##                                   NA's   :317                     
##      name               park             attendance           BPF       
##  Length:2775        Length:2775        Min.   :   6088   Min.   : 60.0  
##  Class :character   Class :character   1st Qu.: 524845   1st Qu.: 97.0  
##  Mode  :character   Mode  :character   Median :1125288   Median :100.0  
##                                        Mean   :1330952   Mean   :100.2  
##                                        3rd Qu.:1984406   3rd Qu.:103.0  
##                                        Max.   :4483350   Max.   :129.0  
##                                        NA's   :279                      
##       PPF          teamIDBR         teamIDlahman45     teamIDretro       
##  Min.   : 60.0   Length:2775        Length:2775        Length:2775       
##  1st Qu.: 97.0   Class :character   Class :character   Class :character  
##  Median :100.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.2                                                           
##  3rd Qu.:103.0                                                           
##  Max.   :141.0                                                           
##                                                                          
## 
## > file_path <- "TeamsFranchises.reformatted.csv"
## 
## > df6 <- read.csv(file_path, stringsAsFactors = FALSE)
## 
## > summary(df6)
##    franchID          franchName           active         
##  Length:120         Length:120         Length:120        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    NAassoc         
##  Length:120        
##  Class :character  
##  Mode  :character

Visualization:

This scatterplot was created to compare the number of ballots with the number of votes per year. The number of ballots are the number people who voted for the Hall of Fame during that specific year and the number of votes are the number of votes that a person has gained to be considered for the Hall of Fame. The green squares are the ballots for that specific year and the blue circles are specific people that have gained a certain number of votes. Also the color of the green squares are homeruns where the lighter the color the less number of homeruns and the darker the color the more number of homeruns. Furthermore, the color of the blue circles are the number of hits where the lighter the color the less number of hits the person had and the darker the color the more number of hits the person had. The reason that this visual is important is because it shows a positive correlation for both scatterplots and the color of the points get darker as the years go by which means that more people are voting for the Hall of Fame every year and players get more hits and homeruns than the year before. Finally, this shows that as the players get more hits and homeruns then more people are likely to vote for Hall of Fame.

Bar Chart 1: This barchart was created to look at the homerun average per hit of each team that was at bat at least 150k times. The bars were colored using a key performance indicator showing that any team who hit homeruns 10% or greater when getting a hit are blue while 5% to 10% are green and anything percent lower is orange. This shows that any team who were at bat higher than the average number of times at bat (386k) are only between 5% and 10% likely of getting a homerun when getting a hit. Furthermore, most of those below the average are either 10% and higher or less than 5% which is interesting how teams who had similar number of times at bat have a huge difference in number of homeruns.

Scatterplot 2: This scatterplot was created to compare the number of votes needed to be selected for the Hall of Fame with the number of games played per year. These two scatterplots show a positive correlation which is interesting because it introduces the possibiltly that as the number of games played over the years increase then the number of votes needed to be selected for the Hall of Fame also increase.

This visualization depicts the batting averages for each active team from 1950-2014 via boxplot. Each page details the average, median, upper and lower quartiles, as well as the outliars of each team’s batting average. This plot is interesting because it tells whether these values remained the same or fluctuated for each team as time passed by. For further analysis, this plot could be blended with the Teams csv in order to determine if the batting average for each team could be related to that team’s winning record (wins/games played).

Salary: In the csv for franchise we have a dataset of how much each franchise manages to spend on salary. This visualization was created to be able to compare how much of a correlation there is between salary spent and a teams winning ratio, a calculated field created from dividing the number of wins by total games. From this bar chart we can see that there is a strong correlation between them, which could mean several things. A casual relationship where the more money spent on a team the more likely they are t win, an effect relationship which means that because they win so much they therefore have the money to spend on salary. Or neither which would mean that both of them are being caused by a third unknown factor. But one thing is for certain is that they are both related.

Hall of Fame Geographic: In this visualization we obtained every baseball hall of famer and marked down they’re hometown on the map. Once there we color coded them by the persons category (ie. player, umpire, manager, etc). In the detail sheet we added the year they were added along with the name of the person. Something that can be observed about this graph is that it is very easy to see which areas of the United States have more often than not been prone to creating a baseball hall of Famer. One of the most notable observations is that the New England states have a much higher density in Baseball hall of famers than any other region, particularly the Midwest with little to now baseball legends.

Using the data from the franchise csv we are able to create a crosstab comparing the franchises over the course of several years with their winning ration, the calculated field mentioned earlier. Blending the franchise and team csv did this. In the scatter plot we are able to see whether a team had a winning record (>0.6) or a losing record (<0.4), color-coding it for ease of visibility. An observation we can make is many teams often got into streaks for several years. Either winning streaks or losing streaks would keep going, however this trend diminished by a big amount once more teams started being added to the league. One could say with increased competition the sport became harder to get on a certain type of streak.

Shiny:

Shiny App: Our Shiny App can be found here : https://bswass.shinyapps.io/04Shiny